Aim

  • What is R
  • Why use R
  • Who use R
  • Getting Started with R

What is R?

  • A statistical programming environment
    • based on ‘S’
    • suited to high-level data analysis
  • But offers much more than just statistics
  • Open source and cross platform
  • Extensive graphics capabilities
  • Diverse range of add-on packages
  • Active community of developers
  • Thorough documentation

http://www.r-project.org/

R screenshot

New York Times, Jan 2009

Why R?

We can do lot of stuffs in R. Starting from statestical analysis to plotting graphs and figures, Writing technical documentation to making a website and lot more. Lets explore.

R can facilitate Reproducible Research

Sidney Harris - New York Times

  • Statisticians at MD Anderson tried to reproduce results from a Duke paper and unintentionally unravelled a web of incompetence and skullduggery
    • as reported in the New York Times

New York Times, July 2011

  • Very entertaining talk from Keith Baggerly in Cambridge, December 2010

According to recent editorials, the reproducibility crisis is still on-going

Nature, May 2016

Reality check on reproducibility

1,500 scientists lift the lid on reproducibility

Getting started

R can be done/executed using command line, or a graphical user interface (GUI). On this course, we use the RStudio GUI. Lets download required files and install.

  1. Get latest version of R from CRAN
    • This will come will some base packages installed for general tasks.
    • We can also download additional required packages from CRAN for specific tasks, Currently 16201 available packages as of Wed Sep 2 11:50:45 2020
  2. Download RStudio from here

To launch RStudio, find the RStudio icon and click

RStudio screenshot

Key Points

  • Vast Graphics control in R
  • Reproducibility can done with R

Aim

  • Understanding Basic Data Types and Data Structures in R

Data Type

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on them.

Data structures are very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object.

R has 6 basic data types. (In addition to the five listed below, there is also raw which will not be discussed in this session.)

  • character
  • numeric (real or decimal)
  • integer
  • logical
  • complex

Elements of these data types may be combined to form data structures, such as atomic vectors. When we call a vector atomic, we mean that the vector only holds data of a single data type. Below are examples of atomic character vectors, numeric vectors, integer vectors, etc.

  • character: "a", "swc"
  • numeric: 2, 15.5
  • integer: 2L (the L tells R to store this as an integer)
  • logical: TRUE, FALSE
  • complex: 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for example

  • class() - what kind of object is it (high-level)?
  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
  • attributes() - does it have any metadata?
# Example
x <- "biology"
typeof(x)
## [1] "character"
attributes(x)
## NULL
y <- 1:10
y
##  [1]  1  2  3  4  5  6  7  8  9 10
typeof(y)
## [1] "integer"
length(y)
## [1] 10
z <- as.numeric(y)
z
##  [1]  1  2  3  4  5  6  7  8  9 10
typeof(z)
## [1] "double"

Data Structure

R has many data structures. These include

  • atomic vector
  • list
  • matrix
  • data frame
  • factors

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:

  • atomic vectors
  • lists

although the term “vector” most commonly refers to the atomic types not to lists.

The Different Vector Modes

A vector is a collection of elements that are most commonly of mode character, logical, integer or numeric.

You can create an empty vector with vector(). (By default the mode is logical. You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character(), numeric(), etc.

vector() # an empty 'logical' (the default) vector
## logical(0)
vector("character", length = 5) # a vector of mode 'character' with 5 elements
## [1] "" "" "" "" ""
character(5) # the same thing, but using the constructor directly
## [1] "" "" "" "" ""
numeric(5)   # a numeric vector with 5 elements
## [1] 0 0 0 0 0
logical(5)   # a logical vector with 5 elements
## [1] FALSE FALSE FALSE FALSE FALSE

You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector. For instance:

x <- c(1, 2, 3)

will create a vector x of mode numeric. These are the most common kind, and are treated as double precision real numbers. If you wanted to explicitly create integers, you need to add an L to each element (or coerce to the integer type using as.integer()).

x1 <- c(1L, 2L, 3L)

Using TRUE and FALSE will create a vector of mode logical:

y <- c(TRUE, TRUE, FALSE, FALSE)

While using quoted text will create a vector of mode character:

z <- c("Sarah", "Tracy", "Jon")

Examining Vectors

The functions typeof(), length(), class() and str() provide useful information about your vectors and R objects in general.

typeof(z)
## [1] "character"
length(z)
## [1] 3
class(z)
## [1] "character"
str(z)
##  chr [1:3] "Sarah" "Tracy" "Jon"

Adding Elements

The function c() (for combine) can also be used to add elements to a vector.

z <- c(z, "Annette")
z
## [1] "Sarah"   "Tracy"   "Jon"     "Annette"
z <- c("Greg", z)
z
## [1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

Vectors from a Sequence of Numbers

You can create vectors as a sequence of numbers.

series <- 1:10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(from = 1, to = 10, by = 0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
## [16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
## [31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
## [46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
## [61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
## [76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
## [91] 10.0

Missing Data

R supports missing data in vectors. They are represented as NA (Not Available) and can be used for all the vector types covered in this lesson:

x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
x <- c(1+5i, 2-3i, NA)

The function is.na() indicates the elements of the vectors that represent missing data, and the function anyNA() returns TRUE if the vector contains any missing values:

x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
## [1] FALSE  TRUE FALSE FALSE  TRUE
is.na(y)
## [1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
## [1] TRUE
anyNA(y)
## [1] FALSE

Other Special Values

Inf is infinity. You can have either positive or negative infinity.

1/0
## [1] Inf

NaN means Not a Number. It’s an undefined value.

0/0
## [1] NaN

What Happens When You Mix Types Inside a Vector?

R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”. For instance, can you guess what the following do (without running them first)?

xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)

You can also control how vectors are coerced explicitly using the as.<class_name>() functions:

as.numeric("1")
## [1] 1
as.character(1:2)
## [1] "1" "2"

Finding Commonalities

Do you see a property that’s common to all these vectors above?

Solution
All vectors are one-dimensional and each element is of the same type.

Objects Attributes

Objects can have attributes. Attributes are part of the object. These include:

  • names
  • dimnames
  • dim
  • class
  • attributes (contain metadata)

You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).

length(1:10)
## [1] 10
nchar("Software Carpentry")
## [1] 18

Matrix

In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.

m <- matrix(nrow = 2, ncol = 2)
m
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
dim(m)
## [1] 2 2

You can check that matrices are vectors with a class attribute of matrix by using class() and typeof().

m <- matrix(c(1:3))
class(m)
## [1] "matrix" "array"
typeof(m)
## [1] "integer"

While class() shows that m is a matrix, typeof() shows that fundamentally the matrix is an integer vector.

Data types of matrix elements

Consider the following matrix:

FOURS <- matrix(
  c(4, 4, 4, 4),
  nrow = 2,
  ncol = 2)

Given that typeof(FOURS[1]) returns "double", what would you expect typeof(FOURS) to return? How do you know this is the case even without running this code?

Hint Can matrices be composed of elements of different data types?

Solution
We know that typeof(FOURS) will also return "double" since matrices are made of elements of the same data type. Note that you could do something like as.character(FOURS) if you needed the elements of FOURS as characters.

Matrices in R are filled column-wise.

m <- matrix(1:6, nrow = 2, ncol = 3)

Other ways to construct a matrix

m      <- 1:10
dim(m) <- c(2, 5)

This takes a vector and transforms it into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using rbind() and cbind() (“row bind” and “column bind”, respectively).

x <- 1:3
y <- 10:12
cbind(x, y)
##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat <- matrix(c(1, 2, 3, 11, 12, 13),
               nrow = 2,
               ncol = 3,
               byrow = TRUE)
mdat
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]   11   12   13

Elements of a matrix can be referenced by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.

mdat[2, 3]
## [1] 13

List

In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.

A list is a special type of vector. Each element can be a different type.

Create lists using list() or coerce other objects using as.list(). An empty list of the required length can be created using vector()

x <- list(1, "a", TRUE, 1+4i)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
x <- vector("list", length = 5) # empty list
length(x)
## [1] 5

The content of elements of a list can be retrieved by using double square brackets.

x[[1]]
## NULL

Vectors can be coerced to lists as follows:

x <- 1:10
x <- as.list(x)
length(x)
## [1] 10

Examining Lists

  1. What is the class of x[1]?
  2. What is the class of x[[1]]?

Solution
1.

```r
class(x[1])
```

```
## [1] "list"
```
  1. class(x[[1]])
    ## [1] "integer"

Elements of a list can be named (i.e. lists can have the names attribute)

xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
## $a
## [1] "Karthik Ram"
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
names(xlist)
## [1] "a"    "b"    "data"

Examining Named Lists

  1. What is the length of this object?
  2. What is its structure?

Solution
1.

```r
length(xlist)
```

```
## [1] 3
```
  1. str(xlist)
    ## List of 3
    ##  $ a   : chr "Karthik Ram"
    ##  $ b   : int [1:10] 1 2 3 4 5 6 7 8 9 10
    ##  $ data:'data.frame':    6 obs. of  5 variables:
    ##   ..$ Sepal.Length: num [1:6] 5.1 4.9 4.7 4.6 5 5.4
    ##   ..$ Sepal.Width : num [1:6] 3.5 3 3.2 3.1 3.6 3.9
    ##   ..$ Petal.Length: num [1:6] 1.4 1.4 1.3 1.5 1.4 1.7
    ##   ..$ Petal.Width : num [1:6] 0.2 0.2 0.2 0.2 0.2 0.4
    ##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
    {: .solution} {: .challenge} Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, you can “staple” together lots of different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

Elements are indexed by double brackets. Single brackets will still return another list. If the elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).

Data Frame

A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.

A data frame is a special type of list where every element of the list has same length (i.e. data frame is a “rectangular” list).

Data frames can have additional attributes such as rownames(), which can be useful for annotating data, like subject_id or sample_id. But most of the time they are not used.

Some additional information on data frames:

  • Usually created by read.csv() and read.table(), i.e. when importing the data into R.
  • Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() (preferred) or as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.
  • Can also create a new data frame with data.frame() function.
  • Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
  • Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.

Creating Data Frames by Hand

To create data frames by hand:

dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
##    id  x  y
## 1   a  1 11
## 2   b  2 12
## 3   c  3 13
## 4   d  4 14
## 5   e  5 15
## 6   f  6 16
## 7   g  7 17
## 8   h  8 18
## 9   i  9 19
## 10  j 10 20

Useful Data Frame Functions

  • head() - shows first 6 rows
  • tail() - shows last 6 rows
  • dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
  • nrow() - number of rows
  • ncol() - number of columns
  • str() - structure of data frame - name, type and preview of data in each column
  • names() or colnames() - both show the names attribute for a data frame
  • sapply(dataframe, class) - shows the class of each column in the data frame {: .callout} See that it is actually a special list:
is.list(dat)
## [1] TRUE
class(dat)
## [1] "data.frame"

Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets (similar to matrix).

dat[1, 3]
## [1] 11

As data frames are also lists, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.

dat[["y"]]
##  [1] 11 12 13 14 15 16 17 18 19 20
dat$y
##  [1] 11 12 13 14 15 16 17 18 19 20

The following table summarizes the one-dimensional and two-dimensional data structures in R in relation to diversity of data types they can contain.

Dimensions Homogenous Heterogeneous
1-D atomic vector list
2-D matrix data frame

Lists can contain elements that are themselves muti-dimensional (e.g. a lists can contain data frames or another type of objects). Lists can also contain elements of any length, therefore list do not necessarily have to be “rectangular”. However in order for the list to qualify as a data frame, the length of each element has to be the same. {: .callout} Column Types in Data Frames

Knowing that data frames are lists, can columns be of different type?

What type of structure do you expect to see when you explore the structure of the iris data frame? Hint: Use str().

Solution
The Sepal.Length, Sepal.Width, Petal.Length and Petal.Width columns are all numeric types, while Species is a Factor. Lists can have elements of different types. Since a Data Frame is just a special type of list, it can have columns of differing type (although, remember that type must be consistent within each column!).

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Key Points

  • Use typeof() and class() to know a object type and its class
  • Remember [row, column]

Aim

  • Reading tabular data to R
  • data analysis

Steps to Basic Data Analysis

  • In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline:
  1. Reading in data
    • read.table()
    • read.csv(), read.delim()
  2. Analysis
    • Manipulating & reshaping the data
      • perhaps dealing with “missing data”
    • Any maths you like
    • Diagnostic Plots
  3. Writing out results
    • write.table()
    • write.csv()

A simple walk-through

  • We have data from 100 patients that given consent for their data to use in future studies
  • A researcher wants to undertake a study involving people that are overweight
  • We will walk-through how to filter the data and write a new file with the candidates for the study

The Working Directory (wd)

  • Like many programs R has a concept of a working directory

  • It is the place where R will look for files to execute and where it will save files, by default

  • For this course we need to set the working directory to the location of the course scripts

  • In R Studio use the mouse and browse to the directory where you saved the Course Materials

  • Session → Set Working Directory → Choose Directory…

Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

  • Functions that import data need a file location as a character vector
  • The default location is the working directory
getwd()
## [1] "/home/sangram/Documents/personal-work/0_git/learn-R/website"
  • If the file you want to read is in your working directory, you can just use the file name
list.files()
  • The file.exists function does exactly what it says on the tin!
    • a good sanity check for your code
file.exists("data/patient-info.txt")
## [1] TRUE
  • Otherwise you need the path to the file
    • you can get this using file.choose()
  • If you unsure about specifying a file path at the command line, this online tutorial will give you hands-on practice

Read in the data

  • The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text
  • We need to read in the results and assign it to an object (patients)
patients <- read.delim("data/patient-info.txt")

In the latest RStudio, there is the option to import data directly from the File menu. File -> Import Dataset -> From Csv

  • If the data are comma-separated, then use either the argument sep="," or the function read.csv():
  • You need to make sure you use the correct function
    • can you explain the output of the following lines of code?
tmp <- read.csv("data/patient-info.txt")
head(tmp)
##                           ID.Race.Sex.Smokes.Height.Weight.State.Pet.Grade.Age
## 1       AC/AH/001\tWhite\tMale\tNon-Smoker\t182.87\t76.57\tGeorgia\tDog\t2\t85
## 2      AC/AH/017\tWhite\tMale\tNon-Smoker\t179.12\t80.43\tMissouri\tDog\t2\t85
## 3 AC/AH/020\tBlack\tMale\tNon-Smoker\t169.15\t75.48\tPennsylvania\tNone\t2\t47
## 4       AC/AH/022\tWhite\tMale\tNon-Smoker\t175.66\t94.54\tFlorida\tCat\t1\t72
## 5         AC/AH/029\tWhite\tFemale\tNon-Smoker\t164.47\t71.78\tIowa\tNA\t2\t70
## 6            AC/AH/033\tNA\tFemale\tSmoker\t158.27\t69.9\tMaryland\tDog\t2\t66
  • For full list of arguments:
?read.table

Check the data

  • Always check the object to make sure the contents and dimensions are as you expect
  • R will sometimes create the object without error, but the contents may be un-usable for analysis
    • If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column
# View the first 10 rows to ensure import is OK
patients[1:10,]  
##           ID     Race    Sex     Smokes Height Weight          State  Pet Grade
## 1  AC/AH/001    White   Male Non-Smoker 182.87  76.57        Georgia  Dog     2
## 2  AC/AH/017    White   Male Non-Smoker 179.12  80.43       Missouri  Dog     2
## 3  AC/AH/020    Black   Male Non-Smoker 169.15  75.48   Pennsylvania None     2
## 4  AC/AH/022    White   Male Non-Smoker 175.66  94.54        Florida  Cat     1
## 5  AC/AH/029    White Female Non-Smoker 164.47  71.78           Iowa <NA>     2
## 6  AC/AH/033     <NA> Female     Smoker 158.27  69.90       Maryland  Dog     2
## 7  AC/AH/037    White Female Non-Smoker 161.69  68.85   Pennsylvania None     1
## 8  AC/AH/044    White Female Non-Smoker 165.84  70.44 North Carolina None     1
## 9  AC/AH/045    White   Male Non-Smoker 181.32  76.90      Louisiana  Dog     1
## 10 AC/AH/048 Hispanic   Male Non-Smoker 167.37  79.06 North Carolina None     2
##    Age
## 1   85
## 2   85
## 3   47
## 4   72
## 5   70
## 6   66
## 7   24
## 8   68
## 9   86
## 10  63
  • or use the View() function to get a display of the data in RStudio:
View(patients)

Understanding the object

  • Once we have read the data successfully, we can start to interact with it
  • The object we have created is a data frame:
class(patients)
## [1] "data.frame"
  • We can query the dimensions:
ncol(patients)
## [1] 10
nrow(patients)
## [1] 100
dim(patients)
## [1] 100  10
  • The names of the columns are automatically assigned:
colnames(patients)
##  [1] "ID"     "Race"   "Sex"    "Smokes" "Height" "Weight" "State"  "Pet"   
##  [9] "Grade"  "Age"
  • We can use any of these names to access a particular column:
    • and create a vector
    • TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list!
patients$ID
##   [1] "AC/AH/001" "AC/AH/017" "AC/AH/020" "AC/AH/022" "AC/AH/029" "AC/AH/033"
##   [7] "AC/AH/037" "AC/AH/044" "AC/AH/045" "AC/AH/048" "AC/AH/049" "AC/AH/050"
##  [13] "AC/AH/052" "AC/AH/053" "AC/AH/057" "AC/AH/061" "AC/AH/063" "AC/AH/076"
##  [19] "AC/AH/077" "AC/AH/086" "AC/AH/089" "AC/AH/100" "AC/AH/104" "AC/AH/112"
##  [25] "AC/AH/113" "AC/AH/114" "AC/AH/115" "AC/AH/127" "AC/AH/133" "AC/AH/150"
##  [31] "AC/AH/154" "AC/AH/156" "AC/AH/159" "AC/AH/160" "AC/AH/164" "AC/AH/171"
##  [37] "AC/AH/176" "AC/AH/180" "AC/AH/185" "AC/AH/186" "AC/AH/192" "AC/AH/198"
##  [43] "AC/AH/207" "AC/AH/208" "AC/AH/210" "AC/AH/211" "AC/AH/213" "AC/AH/219"
##  [49] "AC/AH/220" "AC/AH/221" "AC/AH/225" "AC/AH/233" "AC/AH/241" "AC/AH/244"
##  [55] "AC/AH/248" "AC/AH/249" "AC/SG/002" "AC/SG/003" "AC/SG/008" "AC/SG/009"
##  [61] "AC/SG/010" "AC/SG/015" "AC/SG/016" "AC/SG/046" "AC/SG/055" "AC/SG/056"
##  [67] "AC/SG/064" "AC/SG/065" "AC/SG/067" "AC/SG/068" "AC/SG/072" "AC/SG/074"
##  [73] "AC/SG/084" "AC/SG/095" "AC/SG/099" "AC/SG/101" "AC/SG/107" "AC/SG/116"
##  [79] "AC/SG/121" "AC/SG/122" "AC/SG/123" "AC/SG/134" "AC/SG/139" "AC/SG/142"
##  [85] "AC/SG/155" "AC/SG/165" "AC/SG/167" "AC/SG/172" "AC/SG/173" "AC/SG/179"
##  [91] "AC/SG/181" "AC/SG/182" "AC/SG/191" "AC/SG/193" "AC/SG/194" "AC/SG/197"
##  [97] "AC/SG/204" "AC/SG/216" "AC/SG/217" "AC/SG/234"

Word of warning

Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others) You will make your life a lot easier if you keep your data tidy and organised. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them

Handling missing values

  • The data frame contains some NA values, which means the values are missing – a common occurrence in real data collection
  • NA is a special value that can be present in objects of any type (logical, character, numeric etc)
  • NA is not the same as NULL:
    • NULL is an empty R object.
    • NA is one missing value within an R object (like a data frame or a vector)
  • Often R functions will handle NAs gracefully:
length(patients$Height)
## [1] 100
mean(patients$Height)
## [1] NA
  • However, sometimes we have to tell the functions what to do with them.
  • R has some built-in functions for dealing with NAs, and functions often have their own arguments (like na.rm) for handling them:
    • annoyingly, different functions have different argument names to change their behavior with regards to NA values. Always check the documentation
mean(patients$Height, na.rm = TRUE)
## [1] 167.4969
mean(na.omit(patients$Height))
## [1] 167.4969

Analysis

  • Our analysis involves identifying patients with extreme BMI
    • we will define this as being two standard deviations from the mean
# Create an index of results:
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
upper.limit
## [1] 30.9533
  • We can plot a simple chart of the BMI values
    • add a vertical line to indicate the cut-off
    • plotting will be covered in detail shortly..
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit) 

  • It is also useful to save the variable we have computed as a new column in the data frame
round(BMI,1)
##   [1] 22.9 25.1 26.4 30.6 26.5 27.9 26.3 25.6 23.4 28.2 28.2   NA 30.0 27.9 24.5
##  [16] 22.0 25.6 31.5 23.8   NA 23.5 26.7 31.4   NA 24.6   NA 24.8 29.2   NA 24.1
##  [31] 25.1 28.0 29.4 28.2 23.6 26.4   NA 25.0 27.7 27.0 25.6 26.7 24.5 26.1 23.1
##  [46] 28.2 26.9   NA 25.4 25.9   NA 24.8 28.2   NA 30.4 26.8 26.0 25.2 26.9 31.7
##  [61] 25.6   NA 26.7 27.8 28.4   NA 31.5 27.0 30.0 26.5 25.2   NA 26.7 25.8   NA
##  [76] 27.6 29.1 26.6 26.6 26.9 27.6 26.4 27.8   NA 27.8 25.8 27.7 28.7 24.2 24.6
##  [91] 28.3 24.8 27.8 21.4 28.0 26.0 26.2 26.4 27.7   NA
patients$BMI <- round(BMI,1)
head(patients)
##          ID  Race    Sex     Smokes Height Weight        State  Pet Grade Age
## 1 AC/AH/001 White   Male Non-Smoker 182.87  76.57      Georgia  Dog     2  85
## 2 AC/AH/017 White   Male Non-Smoker 179.12  80.43     Missouri  Dog     2  85
## 3 AC/AH/020 Black   Male Non-Smoker 169.15  75.48 Pennsylvania None     2  47
## 4 AC/AH/022 White   Male Non-Smoker 175.66  94.54      Florida  Cat     1  72
## 5 AC/AH/029 White Female Non-Smoker 164.47  71.78         Iowa <NA>     2  70
## 6 AC/AH/033  <NA> Female     Smoker 158.27  69.90     Maryland  Dog     2  66
##    BMI
## 1 22.9
## 2 25.1
## 3 26.4
## 4 30.6
## 5 26.5
## 6 27.9
  • To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit
    • if the second line looks a bit weird, remember that <- is doing an assignment. The value we are assigning to our new variable is the logical (TRUE or FALSE) vector given by testing each item in BMI against the upper.limit
BMI > upper.limit
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
##  [13] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE    NA FALSE FALSE  TRUE    NA
##  [25] FALSE    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
##  [49] FALSE FALSE    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE  TRUE
##  [61] FALSE    NA FALSE FALSE FALSE    NA  TRUE FALSE FALSE FALSE FALSE    NA
##  [73] FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE    NA
candidates <- BMI > upper.limit

We have seen that a logical vector can be used to subset a data frame

  • However, in our case the result looks a bit funny
  • Can you think why this might be?
patients[candidates,]
##              ID  Race  Sex     Smokes Height Weight     State  Pet Grade Age
## NA         <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## 18    AC/AH/076 White Male Non-Smoker 176.22  97.67 Louisiana  Cat     2  26
## NA.1       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## 23    AC/AH/104 White Male     Smoker 169.85  90.63  Kentucky None     1  87
## NA.2       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.3       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.4       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.5       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.6       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.7       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.8       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## 60    AC/SG/009 White Male Non-Smoker 166.84  88.25   Vermont  Dog     1  43
## NA.9       <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.10      <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## 67    AC/SG/064 White Male Non-Smoker 169.16  90.08  Illinois  Cat     2  44
## NA.11      <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.12      <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.13      <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
## NA.14      <NA>  <NA> <NA>       <NA>     NA     NA      <NA> <NA>    NA  NA
##        BMI
## NA      NA
## 18    31.5
## NA.1    NA
## 23    31.4
## NA.2    NA
## NA.3    NA
## NA.4    NA
## NA.5    NA
## NA.6    NA
## NA.7    NA
## NA.8    NA
## 60    31.7
## NA.9    NA
## NA.10   NA
## 67    31.5
## NA.11   NA
## NA.12   NA
## NA.13   NA
## NA.14   NA

The which function will take a logical vector and return the indices of the TRUE values

  • This can then be used to subset the data frame
which(BMI > upper.limit)
## [1] 18 23 60 67
candidates <- which(BMI > upper.limit)

Outputting the results

  • We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a ‘comma separated values’ text file (CSV):
write.csv(patients[candidates,], file="outputs/selectedSamples.csv")
  • The output file is directly-readable by Excel
  • It’s often helpful to double check where the data has been saved. Use the get working directory function:
getwd()      # print working directory
list.files() # list files in working directory

To recap, the set of R commands we have used is:-

patients <- read.delim("data/patient-info.txt")
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit) 

patients$BMI <- round(BMI,1)
candidates <- which(BMI > upper.limit)
write.csv(patients[candidates,], file="outputs/selectedSamples.csv")

Exercise: Exercise 3

  • A separate study is looking for patients that are underweight and also smoke;
    • Modify the condition in our previous code to find these patients
    • e.g. having BMI that is 2 standard deviations less than the mean BMI
    • Write out a results file of the samples that match these criteria, and open it in a spreadsheet program

Key Points

  • Import data using read.table()
 

Created and Maintained by Sangram Keshari Sahu
Rmarkdown Template used from Rmdplates package
Licensed under CC-BY 4.0
Source Code At GitHub